Study of entity-topic models for OOV proper name retrieval
نویسندگان
چکیده
Retrieving Proper Names (PNs) relevant to an audio document can improve speech recognition and content based audio-video indexing. Latent Dirichlet Allocation (LDA) topic model has been used to retrieve Out-Of-Vocabulary (OOV) PNs relevant to an audio document with good recall rates. However, retrieval of OOV PNs using LDA is affected by two issues, which we study in this paper: (1) Word Frequency Bias (less frequent OOV PNs are ranked lower); (2) Loss of Specificity (the reduced topic space representation loses lexical context). Entity-Topic models have been proposed as extensions of LDA to specifically learn relations between words, entities (PNs) and topics. We study OOV PN retrieval with Entity-Topic models and show that they are also affected by word frequency bias and loss of specificity. We evaluate our proposed methods for rare OOV PN re-ranking and lexical context re-ranking for LDA as well as for EntityTopic models. The results show an improvement in both Recall and the Mean Average Precision.
منابع مشابه
How Diachronic Text Corpora Affect Context based Retrieval of OOV Proper Names for Audio News
Out-Of-Vocabulary (OOV) words missed by Large Vocabulary Continuous Speech Recognition (LVCSR) systems can be recovered with the help of topic and semantic context of the OOV words captured from a diachronic text corpus. In this paper we investigate how the choice of documents for the diachronic text corpora affects the retrieval of OOV Proper Names (PNs) relevant to an audio document. We first...
متن کاملEffective Transliteration
The translation of texts written in different languages is required in many domains, such as machine translation and cross-lingual information retrieval. Translating words of a text from a source language into a different target language can be efficiently achieved using a bilingual vocabulary, where every source word has a counterpart in the target language. In practice, however, there are oft...
متن کاملSpeech Recognition of Foreign Out-o Hierarchical Lang
This paper proposes a new speech recognition scheme for foreign out-of-vocabulary words embedded in native-language speech. To recognize foreign names frequently observed in news speech or in translation speech, we adopted a hierarchical language model that had been successfully applied to OOV words covering native vocabularies. In this hierarchical language model, OOV vocabularies are modeled ...
متن کاملIndonesian-Japanese CLIR Using Only Limited Resource
Our research aim here is to build a CLIR system that works for a language pair with poor resources where the source language (e.g. Indonesian) has limited language resources. Our IndonesianJapanese CLIR system employs the existing Japanese IR system, and we focus our research on the IndonesianJapanese query translation. There are two problems in our limited resource query translation: the OOV p...
متن کاملColing • Acl 2006
Our research aim here is to build a CLIR system that works for a language pair with poor resources where the source language (e.g. Indonesian) has limited language resources. Our IndonesianJapanese CLIR system employs the existing Japanese IR system, and we focus our research on the IndonesianJapanese query translation. There are two problems in our limited resource query translation: the OOV p...
متن کامل